Import Necessary Packages and Libraries

Data Cleaning/EDA

Read in Data

Shorten Column Names

Quick EDA

Key Insights from EDA

Feature Engineering

Methodology

Based off the EDA above and our general project intent, we will develop features that include log/sqrt feature transformations when appropriate. In addition, we will also include features representative of unique pairwise ratios of ingredients. Finally, we will also include an additional categorical treatment of age through binning as well as associated interaction effects of binned age with the previously mentioned features to account for potential non-linearity of age while providing the means to adjust other slope/intercept coefficients under specific age conditions.

This feature engineering process will effectively explode our feature space into significantly higher dimensions. Therefore, our model development process will rely heavily on methods that attempt to provide regularization to models in order to preserve model parsimony. While our initial feature space is high, we anticipate that our selection of final models will be significantly lower dimensional models as a result of our model development frameworks identifying the subset of truly significant features. Overall, this process will allow us to identify how individual components, ratios of components, and age ultimately impact concrete compressive strength.

Log/SQRT and ^2 Transformations

Pairwise ratio features among ingrediants (non-age features)

Binned Treatment of Age along with interaction effect with individual ingrediants (transformed and non-transformed) and ratios

Final Labeled Dataset

Feature Data Shape (rows,total number of features)

80-20 Train/Test Split into Training/Exploration Set and Test Set

Model Development

Methodology

We currently have 122 features, a majority of which are likely not significant towards predicting strength. Our goal for the model development process is to use a variety of techniques to identify models that achieve better model parsimony by identifying feature subsets that are truly significant without incorporating features that do not provide value in a linear model. We will utilize the following 4 methods for identifying optimal models:

  1. Backwards Stepwise: Start with a full featured model and sequentially remove the most insignificant features from the model until only significant features remain below a specified alpha.
  2. Forward-Backward Stepwise: Start with an optimal model, and sequenially add the most significant features to the model until an additional significant feature cannot be identified under a specified alpha. After an additional feature cannot be identified, we will execute a backwards stepwise process to remove any identified features that became insignificant due to feature additions.
  3. Forward Stepwise w/ 5 Fold Cross Validation: This method will break our training set into 5 folds and generally train on 4 folds before being evaluated on the hold out folds. Used to identify how well a model generalizes to out of sample data, we will start with an empty model and use this framework to sequentially identify how well specific feature additions alongside a linear model generalize to out of sample data by identifying the feature addition that causes the maximum reduction in out of sample error (via root mean squared error). This process will continue until a feature addition cannot be made without viewing an increase in out of sample error. Following the identification of optimal features, the model will be trained with these features on the full training dataset and a backwards stepwise approach will be used as necessary to eliminate insignificant features.
  4. Backwards Stepwise w/ 5 Fold Cross Validation: Similar to the forward stepwise counterpart, the major difference is this method starts with a full featured model and aims to sequentially identify feature dropouts that maximize reduction in out of sample error. After an optimal feature subset has been identified, a model will be trained with these features on the full training dataset and a backwards stepwise approach will be used as necessary to eliminate insignificant features.

Large Sample Model Assumptions

Typically, we would use sklearn's LinearRegression implementation for a problem such as this. However, while optimized for evaluating out of sample accuracy/r2/etc., sklearn does not have a built in coefficient testing framework. Therefore, we will build our own model class

Custom Linear Regression Model Class

Method 1: Backward Stepwise

Method 1 Test

Method 2: Forward-Backward Stepwise

Method 2 Test

Method 3: Forward Stepwise w/ 5 Fold Cross Validation

Method 3 Test

Method 4: Backward Stepwise w/ 5 Fold Cross Validation

Test specific research sub-question/hypothesis:

  1. Individual ingredients (and associated transformations) alongside age and ingredient interaction effects with age will yield a worse performing linear model for determining effects on concrete strength and predicting concrete strength rather than a model that considers ratio ingredient features alongside age and ratio feature interaction effects with age

Null: Individual ingredient model performance is equal to ratio ingredient model performance
Alternate: Individual ingredient model performance is worse than ratio ingredient model performance

Method 5: Backwards Stepwise Approach, only considering Ingredients/Transformations, Age, and Ingredient Interaction Effects with Age as features

Method 6: Backwards Stepwise Approach only considering Ingrediant ratios, Age, and Ratio Interaction Effects with Age as features

Method 7: Forward-Backwards Stepwise Approach only considering Individual Ingredients/Transformations, Age, and Ingredient Interaction Effects with Age as features

Method 8: Forward Backwards Stepwise Approach only considering Ingrediant ratios, Age, and Ratio Interaction Effects with Age as features

Method 9: Forward Stepwise Approach w/ 5 Fold Cross Validation only considering Individual Ingredients/Transformations, Age, and Ingredient Interaction Effects with Age as features

Method 10: Forward Stepwise Approach w/ 5 Fold Cross Validation only considering Ingrediant ratios, Age, and Ratio Interaction Effects with Age as features

Method 11: Backward Stepwise Approach w/ 5 Fold Cross Validation only considering Individual Ingredients/Transformations, Age, and Ingredient Interaction Effects with Age as features

Method 12: Forward Stepwise Approach w/ 5 Fold Cross Validation only considering Ingrediant ratios, Age, and Ratio Interaction Effects with Age as features